[SPARK-45846][SQL] optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold#53670
[SPARK-45846][SQL] optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold#53670callmepandey wants to merge 1 commit intoapache:masterfrom
Conversation
JIRA Issue Information=== Improvement SPARK-45846 === This comment was automatically generated by GitHub Actions |
9a879d0 to
58a32ae
Compare
…castJoinThreshold ### What changes were proposed in this pull request? This PR fixes an issue where null-aware anti-joins (enabled via spark.sql.optimizeNullAwareAntiJoin) were unconditionally using BroadcastHashJoinExec without checking if the right side was small enough to broadcast according to spark.sql.autoBroadcastJoinThreshold. ### Why are the changes needed? When spark.sql.optimizeNullAwareAntiJoin is enabled, queries using NOT IN with a subquery would always attempt to broadcast the right side, even when it exceeded the broadcast threshold. This could lead to OOM errors with large datasets. ### Does this PR introduce any user-facing change? Yes. When spark.sql.autoBroadcastJoinThreshold is set to -1 (or a small value), null-aware anti-joins will now respect this configuration and fall back to alternative join strategies instead of attempting to broadcast large tables. ### How was this patch tested? Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do not use BroadcastHashJoinExec when broadcast is disabled.
58a32ae to
bc69c6f
Compare
|
Hi @cloud-fan, I've opened this PR to fix an issue where null-aware anti-joins (from SPARK-32290) are not respecting the The Problem: The Fix:
Testing: The PR is rebased on latest master and CI is passing. Would greatly appreciate your review since you signed-off on the original SPARK-32290 implementation. Thank you! |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR fixes an issue where null-aware anti-joins (enabled via
spark.sql.optimizeNullAwareAntiJoin) were unconditionally usingBroadcastHashJoinExecwithout checking if the right side was small enough to broadcast according tospark.sql.autoBroadcastJoinThreshold.Why are the changes needed?
When
spark.sql.optimizeNullAwareAntiJoinis enabled, queries usingNOT INwith a subquery would always attempt to broadcast the right side, even when it exceeded the broadcast threshold. This could lead to OOM errors with large datasets.Does this PR introduce any user-facing change?
Yes. When
spark.sql.autoBroadcastJoinThresholdis set to -1 (or a small value), null-aware anti-joins will now respect this configuration and fall back toBroadcastNestedLoopJoinExecinstead of attempting to broadcast large tables withBroadcastHashJoinExec.Join Strategy Selection:
BroadcastHashJoinExecwithisNullAwareAntiJoin=true(optimized O(M) hash lookup, but risk of OOM)BroadcastHashJoinExecwithisNullAwareAntiJoin=true(optimized)BroadcastNestedLoopJoinExec(slower O(M*N), but avoids OOM)How was this patch tested?
Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do not use BroadcastHashJoinExec when broadcast is disabled.